An Overview of building Scalable production ready sales forecasting Framework

Retail Sale Forecasting

This is the task to build retail sales prediction or forecasting using the data provided by Corporación Favorita Grocery chain in Ecuador. The data was provided for Kaggle competition. I was given this task as part my interview process by Coles for the position of Data Scientist.

To complete the task, Below is the road map.

  1. Brief Info about Corporación Favorita Grocery

  2. Exploratory Data Analysis

  3. Feature Engineering and Selection

  4. Model Training and Selection

  5. Preparing/Deploying Model for Sales Forecasting

  6. Future works

Corporación Favorita Grocery

Corporación Favorita Grocery is grocery stores chain and is being traded under SLF ticker name. It is a puclic listed corporation and based in Pichincha Quito Ecuador.

My Exposure to Ecudor

  1. Julian Assange
  2. Corporación Favorita Grocery

The data is provided by this giant to kaggler community.

Let me have a crack on this data.

Exploratory Data Analysis

As a first step, I start with looking at data tables provided. Below are the questions in my mind to answer by looking at various tables.

  1. What is structure of data?
  2. Is data nicely distributed?
  3. Any Data Governess applied or not?
  4. Can we believe the data and move to next step of Feature Engineering?

There are various kaggle kernels available where people tried to perform EDA. Shopping for Insights - Favorita EDA looks like comprehensive among others. This kernel is written using tidy R and ggplot2.

I will be using plotly express to do my analysis. It is relatively new python interactive plotting tool and is part of plotly family.

In [1]:
import pandas as pd
import plotly.express as px
import os
import glob
In [2]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;
In [3]:
# input data directory
# I have downlaoded the data from Kaggle Competition and unzip using 7z tool in ubuntu command terminal
# TODO: Use the Kaggle API to access the data
data_dir = '/home/ibbu/PycharmProjects/retail_sales_prediction/data/'
In [4]:
files = [f for f in os.listdir(data_dir) if f.endswith('.csv')]
In [5]:
files
Out[5]:
['stores.csv',
 'items.csv',
 'train.csv',
 'transactions.csv',
 'test.csv',
 'oil.csv',
 'sample_submission.csv',
 'holidays_events.csv']
In [6]:
# Quickly see the information about the files and sizes to be read
!ls  -la ../data/
total 5044728
drwxrwxr-x  3 ibbu ibbu       4096 Sep  1 22:08 .
drwxr-xr-x 13 ibbu ibbu       4096 Sep  4 12:48 ..
drwxr-xr-x  2 ibbu ibbu       4096 Aug 30 20:07 favorita-grocery-sales-forecasting
-rw-r--r--  1 ibbu ibbu      22309 Oct 20  2017 holidays_events.csv
-rw-r--r--  1 ibbu ibbu     101841 Oct 20  2017 items.csv
-rw-rw-r--  1 ibbu ibbu        846 Sep  4 11:21 logs.log
-rw-r--r--  1 ibbu ibbu      20580 Oct 20  2017 oil.csv
-rw-r--r--  1 ibbu ibbu   40445582 Oct 20  2017 sample_submission.csv
-rw-r--r--  1 ibbu ibbu       1387 Oct 20  2017 stores.csv
-rw-r--r--  1 ibbu ibbu  126163026 Oct 20  2017 test.csv
-rw-r--r--  1 ibbu ibbu 4997452288 Oct 20  2017 train.csv
-rw-r--r--  1 ibbu ibbu    1552637 Oct 20  2017 transactions.csv

No surprises, training data has the biggest size.

In [7]:
class readDataStore():
    """
    This module is to read the data from data store.
    
    In our case, our date store is local and is in .csv format
    """
    
    def __init__(self, data_dir):
        
        self.data_dir = data_dir
    
    def read_stores(self, file_name):
        
        df = pd.read_csv(self.data_dir + file_name)
        
        return df
    
    def read_items(self, file_name):
        
        df = pd.read_csv(self.data_dir + file_name)
        
        return df
    
    def read_transactions(self, file_name):
        
        df = pd.read_csv(self.data_dir + file_name)
        
        return df
    
    def read_oil(self, file_name):
        
        df = pd.read_csv(self.data_dir + file_name)
        
        return df
    
    def read_holidays(self, file_name):
        
        df = pd.read_csv(self.data_dir + file_name)
        
        return df
    
    def read_train(self, file_name):
        
        df = pd.read_csv(self.data_dir + file_name)
        
        return df
    
    def read_test(self, file_name):
        
        df = pd.read_csv(self.data_dir + file_name)
        
        return df
    
        
        

readDataStore = readDataStore(data_dir)

Store Data

In [8]:
df_store = readDataStore.read_stores('stores.csv')
In [9]:
df_store.shape
Out[9]:
(54, 5)
In [10]:
df_store.describe()
Out[10]:
store_nbr cluster
count 54.000000 54.000000
mean 27.500000 8.481481
std 15.732133 4.693395
min 1.000000 1.000000
25% 14.250000 4.000000
50% 27.500000 8.500000
75% 40.750000 13.000000
max 54.000000 17.000000
In [11]:
df_store_count = df_store.groupby(['state', 'city', 'type', 'cluster'])['store_nbr'].count()

df_store_count = df_store_count.reset_index().rename(columns = {'store_nbr':'count'})

px.bar(df_store_count, x = 'state', y= 'count', color = 'city', facet_col='type')
In [12]:
px.bar(df_store_count, x = 'cluster', y= 'count', color = 'type')

There is interesting observation, cluster 10 has three types of stores while all other clusters have only one kind of store. Lets dig deep what is special about this cluster.

In [13]:
df_store_count[df_store_count['cluster']==10]
Out[13]:
state city type cluster count
7 Esmeraldas Esmeraldas E 10 1
13 Guayas Guayaquil D 10 1
14 Guayas Guayaquil E 10 2
15 Guayas Libertad E 10 1
19 Los Rios Babahoyo B 10 1

Items Data

In [14]:
df_items = readDataStore.read_stores('items.csv')
In [15]:
df_items.shape
Out[15]:
(4100, 4)
In [16]:
df_items_count = df_items.groupby(['family', 'class', 'perishable'])['item_nbr'].count()
df_items_count = df_items_count.reset_index().rename(columns = {'item_nbr':'count'})
In [17]:
df_items_count.head()
Out[17]:
family class perishable count
0 AUTOMOTIVE 6806 0 5
1 AUTOMOTIVE 6808 0 2
2 AUTOMOTIVE 6810 0 6
3 AUTOMOTIVE 6824 0 2
4 AUTOMOTIVE 6848 0 5
In [18]:
px.bar(df_items_count, x = 'family', y= 'count', color = 'class',facet_col='perishable')
In [19]:
px.bar(df_items_count, x = 'family', y= 'count', color = 'perishable')

Transactions

In [21]:
df_transactions = readDataStore.read_transactions('transactions.csv')
In [22]:
df_transactions.count()
Out[22]:
date            83488
store_nbr       83488
transactions    83488
dtype: int64

No Missing Values, date and store number are the unique columns in this table.

In [23]:
import datetime
import calendar
In [24]:
df_transactions.head()
Out[24]:
date store_nbr transactions
0 2013-01-01 25 770
1 2013-01-02 1 2111
2 2013-01-02 2 2358
3 2013-01-02 3 3487
4 2013-01-02 4 1922
In [25]:
df_transactions['date'] = pd.to_datetime(df_transactions['date'])
In [26]:
df_transactions['day_of_week'] = df_transactions['date'].dt.weekday_name
In [27]:
set(df_transactions['day_of_week'])
Out[27]:
{'Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday', 'Wednesday'}
In [28]:
# fig = px.bar(df_transactions, x="date", y="transactions",color ='store_nbr' , barmode="group", facet_row="day_of_week",
#              height= 2000,
#        category_orders={"day": ['Monday', 'Tuesday', 'Wednesday','Thursday','Friday',  'Saturday', 'Sunday']})
# fig.show()
In [29]:
px.violin(df_transactions, y="transactions", x='day_of_week', box=True, points='all')
In [ ]: